CANCORR

Overview

The CANCORR function performs Canonical Correlation Analysis (CCA), a multivariate statistical technique that identifies and measures the linear relationships between two sets of variables. First introduced by Harold Hotelling in 1936, CCA finds linear combinations of each variable set that maximize the correlation between them, producing canonical variates — pairs of composite variables with the highest possible correlation.

Given two sets of variables X = (x_1, \ldots, x_n) and Y = (y_1, \ldots, y_m), CCA seeks weight vectors a and b such that the correlation between U = a^T X and V = b^T Y is maximized. Subsequent pairs of canonical variates are derived with the constraint that they are uncorrelated with all previous pairs. The number of canonical correlations equals \min(n, m).

This implementation uses singular value decomposition (SVD) via the statsmodels CanCorr class. The algorithm computes canonical correlations by solving an eigenvalue problem on the cross-covariance structure of the standardized variables. The function returns:

Canonical correlations: Values ranging from 0 to 1 indicating the strength of each canonical relationship
Eigenvalues: Computed from canonical correlations as \lambda = r^2 / (1 - r^2)
Wilks’ lambda: A multivariate test statistic for the null hypothesis of no correlation
Chi-square statistics: Bartlett’s approximation for hypothesis testing:

\chi^2 = -\left(n - 1 - \frac{p + q + 1}{2}\right) \ln(\Lambda)

where n is the number of observations, p and q are the number of variables in each set, and \Lambda is Wilks’ lambda.

CCA is widely used in psychology, ecology, marketing research, and bioinformatics to explore relationships between measurement batteries, such as comparing personality inventories or linking gene expression data to phenotypic outcomes. For additional background, see the Wikipedia article on Canonical Correlation and the statsmodels multivariate documentation.

This example function is provided as-is without any representation of accuracy.

Excel Usage

=CANCORR(x_vars, y_vars, standardize)

x_vars (list[list], required): First set of variables where rows are observations and columns are variables.
y_vars (list[list], required): Second set of variables where rows are observations and columns are variables.
standardize (bool, optional, default: true): Whether to standardize variables (mean=0, std=1) before analysis.

Returns (list[list]): 2D list with canonical correlations, or error message string.

Examples

Example 1: Demo case 1

Inputs:

x_vars		y_vars
1	2.5	2.1	1.5
2	3.2	3	2.8
3	4.1	4.2	3.1
4	5.3	5.1	4.5
5	6	6	5.2

Excel formula:

=CANCORR({1,2.5;2,3.2;3,4.1;4,5.3;5,6}, {2.1,1.5;3,2.8;4.2,3.1;5.1,4.5;6,5.2})

Expected output:

"non-error"

Example 2: Demo case 2

Inputs:

x_vars			y_vars			standardize
1.2	2.8	1.9	3.4	2.1	1.6	true
2.3	3.5	2.4	4.2	3.3	2.5
3.1	4.2	3.7	5.3	4.5	3.2
4.5	5.1	4.2	6.2	5.2	4.7
5.3	6.4	5.6	7.1	6.5	5.3
6.7	7.3	6.1	8.3	7.2	6.8
7.2	8.1	7.4	9.1	8.4	7.2

Excel formula:

=CANCORR({1.2,2.8,1.9;2.3,3.5,2.4;3.1,4.2,3.7;4.5,5.1,4.2;5.3,6.4,5.6;6.7,7.3,6.1;7.2,8.1,7.4}, {3.4,2.1,1.6;4.2,3.3,2.5;5.3,4.5,3.2;6.2,5.2,4.7;7.1,6.5,5.3;8.3,7.2,6.8;9.1,8.4,7.2}, TRUE)

Expected output:

"non-error"

Example 3: Demo case 3

Inputs:

x_vars		y_vars
1	1.5	1.8	2.1
2	2.2	2.7	3.3
3	3.8	3.5	4.2
4	4.5	4.6	5.1
5	5.7	5.4	6.5
6	6.3	6.9	7.2
7	7.9	7.3	8.4

Excel formula:

=CANCORR({1,1.5;2,2.2;3,3.8;4,4.5;5,5.7;6,6.3;7,7.9}, {1.8,2.1;2.7,3.3;3.5,4.2;4.6,5.1;5.4,6.5;6.9,7.2;7.3,8.4})

Expected output:

"non-error"

Example 4: Demo case 4

Inputs:

x_vars		y_vars		standardize
1	2.5	2.1	1.5	false
2	3.2	3	2.8
3	4.1	4.2	3.1
4	5.3	5.1	4.5
5	6	6	5.2

Excel formula:

=CANCORR({1,2.5;2,3.2;3,4.1;4,5.3;5,6}, {2.1,1.5;3,2.8;4.2,3.1;5.1,4.5;6,5.2}, FALSE)

Expected output:

"non-error"

Python Code

import math
from statsmodels.multivariate.cancorr import CanCorr as statsmodels_cancorr

def cancorr(x_vars, y_vars, standardize=True):
    """
    Performs Canonical Correlation Analysis (CCA) between two sets of variables.

    See: https://www.statsmodels.org/stable/generated/statsmodels.multivariate.cancorr.CanCorr.html

    This example function is provided as-is without any representation of accuracy.

    Args:
        x_vars (list[list]): First set of variables where rows are observations and columns are variables.
        y_vars (list[list]): Second set of variables where rows are observations and columns are variables.
        standardize (bool, optional): Whether to standardize variables (mean=0, std=1) before analysis. Default is True.

    Returns:
        list[list]: 2D list with canonical correlations, or error message string.
    """
    def to2d(x):
        return [[x]] if not isinstance(x, list) else x

    def validate_2d_array(arr, name):
        # Validate that arr is a 2D list of numeric values
        if not isinstance(arr, list):
            return f"Invalid input: {name} must be a 2D list."
        if len(arr) == 0:
            return f"Invalid input: {name} must not be empty."
        for i, row in enumerate(arr):
            if not isinstance(row, list):
                return f"Invalid input: {name} must be a 2D list."
            if len(row) == 0:
                return f"Invalid input: {name} rows must not be empty."
            for j, val in enumerate(row):
                if not isinstance(val, (int, float, bool)):
                    return f"Invalid input: {name}[{i}][{j}] must be numeric."
                num_val = float(val)
                if math.isnan(num_val) or math.isinf(num_val):
                    return f"Invalid input: {name}[{i}][{j}] must be finite."
        # Check that all rows have the same length
        row_lengths = [len(row) for row in arr]
        if len(set(row_lengths)) > 1:
            return f"Invalid input: {name} must have consistent row lengths."
        return None

    # Normalize inputs
    x_vars = to2d(x_vars)
    y_vars = to2d(y_vars)

    # Validate inputs
    error = validate_2d_array(x_vars, "x_vars")
    if error:
        return error
    error = validate_2d_array(y_vars, "y_vars")
    if error:
        return error

    # Validate standardize
    if not isinstance(standardize, bool):
        return "Invalid input: standardize must be a boolean."

    # Check that x_vars and y_vars have the same number of rows
    if len(x_vars) != len(y_vars):
        return "Invalid input: x_vars and y_vars must have the same number of observations (rows)."

    # Check minimum number of observations
    n_obs = len(x_vars)
    n_x_vars = len(x_vars[0])
    n_y_vars = len(y_vars[0])

    if n_obs < max(n_x_vars, n_y_vars) + 1:
        return "Invalid input: number of observations must be greater than the number of variables."

    try:
        # Perform canonical correlation analysis
        cca = statsmodels_cancorr(x_vars, y_vars, standardize=standardize)

        # Get test results
        corr_test = cca.corr_test()

        # Build output table
        output = []

        # Header row
        output.append([
            'canonical_variate',
            'correlation',
            'eigenvalue',
            'wilks_lambda',
            'chi_square',
            'df',
            'p_value'
        ])

        # Results for each canonical correlation
        n_cv = len(cca.cancorr)
        for i in range(n_cv):
            # Calculate eigenvalue from canonical correlation
            r = float(cca.cancorr[i])
            eigenval = (r * r) / (1.0 - r * r) if r < 1.0 else float('inf')

            # Get Wilks' lambda from test results
            wilks = float(corr_test.stats.loc[i, "Wilks' lambda"])

            # Calculate chi-square using Bartlett's approximation
            chi_sq = -(n_obs - 1.0 - (n_x_vars + n_y_vars + 1.0) / 2.0) * math.log(wilks) if wilks > 0 else float('inf')

            # Get degrees of freedom and p-value
            df = float(corr_test.stats.loc[i, 'Num DF'])
            pval = float(corr_test.stats.loc[i, 'Pr > F'])

            row = [
                i + 1,  # canonical variate number
                r,  # canonical correlation
                eigenval,  # eigenvalue
                wilks,  # Wilks' lambda
                chi_sq,  # chi-square
                df,  # degrees of freedom
                pval  # p-value
            ]
            output.append(row)

        # Add blank row separator
        output.append([''] * 7)

        # Add X coefficients section
        output.append(['X Coefficients'] + [''] * 6)
        x_coef_header = ['Variable'] + [f'CV{j+1}' for j in range(n_cv)] + [''] * (7 - n_cv - 1)
        output.append(x_coef_header[:7])

        for i in range(n_x_vars):
            row = [f'X{i+1}'] + [float(cca.x_cancoef[i, j]) for j in range(n_cv)] + [''] * (7 - n_cv - 1)
            output.append(row[:7])

        # Add blank row separator
        output.append([''] * 7)

        # Add Y coefficients section
        output.append(['Y Coefficients'] + [''] * 6)
        y_coef_header = ['Variable'] + [f'CV{j+1}' for j in range(n_cv)] + [''] * (7 - n_cv - 1)
        output.append(y_coef_header[:7])

        for i in range(n_y_vars):
            row = [f'Y{i+1}'] + [float(cca.y_cancoef[i, j]) for j in range(n_cv)] + [''] * (7 - n_cv - 1)
            output.append(row[:7])

        return output

    except ValueError as e:
        return f"Calculation error: {e}"
    except Exception as e:
        return f"Calculation error: {e}"

Overview

Excel Usage

Examples

Python Code

Online Calculator